Check for duplicate files

I´ve seen "duplicate" scenarios before, even a bunch of mac apps that helps to find and/or get rid of duplicate files in the hard disk in order to gain some space. That is not my case this time though, and by the way, this is NOT a weird-o case. I´m sure many of us found our selfs in a similar situation every once in a while.

So please, keep on reading.

I have a (1) small folder full of (hundreds) single files with no sub-folder system in it, and another (2) big folder full of (thousands) of files, folders, subfolders and more files....arranged in complex directory scheme.


Most of the files in folder (1) have a copy in folder (2), and that is OK. In fact what I need is to make sure that ALL files in folder (1) have an exact copy in folder 2 regardless of it´s heritage.

I can go thru the process of checking each file, one by one, and once finding its copy in folder (2), delete it in folder (1). That way I will end up with a tiny little folder (1) with only the few files which´s copy couldn't be found in folder (2).

Is there any way to automate the process?, using automator perhaps?, do you know an app that can help me achieve that?

Thanks a lot!

Posted on Jun 21, 2016 11:53 AM

Reply
7 replies

Jun 24, 2016 10:58 AM in response to gefaria

As I said before... the devil is in the details 🙂


30,000 is a lot of files to have in one directory. From what you said I expected it to be large, which is why I opted for a find in the shell rather than the Finder, but even then there are limitations.


That said, since you're still getting errors, the simplest solution would be to break down the script into a loop that processes a subset of the files each time, something replacing the line:



set folder1Files to every file of folder1


with:


repeat with firstChar in {"a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"}


set folder1Files to (every file of folder1 whose name begins with firstChar)

...


(with a corresponding 'end repeat' at the end). This will break the list of 30,000 files into (hopefully) more manageable chunks based on the first character - you can change that to break on any parameters you like if you need to

Jun 21, 2016 11:04 PM in response to gefaria

Conceptually this is pretty straightforward - a little AppleScript, or maybe shell script and you're done.


As always with these kinds of questions, though, the devil is in the details. In this case, what - specifically - constitutes a match? are files with the same name considered a match? what if one is newer/older/larger/smaller than the other? are they still considered a match? which one should you keep? the newest? the biggest?


Once you clarify that, the rest should be easy.

Jun 23, 2016 1:00 AM in response to gefaria

The following script (minimally tested!) should do what you want. Copy the script into a new Script Editor document and run it (the delete command is actually a misnomer... it only moves the files to the trash so there's still a chance of recovery.


set folder1 to (choose folder with prompt "Please select the folder to be cleaned")

set folder2 to (choose folder with prompt "Please select the folder to compare")


-- fast way to get a list files in a directory

set fileList to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -exec basename {} \\;"

set fileList to paragraphs of fileList


tell application "Finder"

set folder1Files to every file of folder1

repeat with eachFile in folder1Files

set fName to name of eachFile

if fName is in fileList then


-- we have a filename match

set f1Size to size of eachFile as integer

set matchingf2File to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -name " & quoted form of fName & " -size " & f1Size & "c"

if matchingf2File is not "" then


-- we have a duplicate, so:


deleteeachFile

end if

end if

end repeat

end tell


It probably needs some explanation...


It starts off by prompting for two folders - the first should be the one that contains the flat directory that you want to clean up (folder1). The second should be the one with the hierarchal directories you want to search in (folder2).


set folder1 to (choose folder with prompt "Please select the folder to be cleaned")

set folder2 to (choose folder with prompt "Please select the folder to compare")


Then it uses a shell command find to get a listing of all the files in folder2. I do this because the Finder is notoriously slow in traversing large directory trees, so even though using the Finder would be simpler, it would be much slower.


set fileList to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -exec basename {} \\;"

set fileList to paragraphs of fileList


Now I iterate through the files in folder1, checking the name of the file against the cached list of files in folder2.


set folder1Files to every file of folder1

repeat with eachFile in folder1Files

set fName to name of eachFile

if fName is in fileList then

If there are no matches I move on to the next file, otherwise we at least have a file that has the same name, so we need to check its size.


set f1Size to size of eachFile as integer

set matchingf2File to do shell script "/usr/bin/find " & quoted form of POSIX path of folder2 & " -type f -name " & quoted form of fName & " -size " & f1Size & "c"


Here I use another shell trick. I first get the size of the current file. I then perform another find to find a file that has the same size as the current find. If I get back an empty list I know the file sizes are different, so I leave the file alone, but if the file sizes match I know it's safe to delete the file.


I know this may sound convoluted, but for a large directory tree, with a large number of files in folder1, it would be cumbersome/unwieldy to perform a full depth traversal of folder2 for every file, so I first cache the list of file names and just do a secondary search for those that have matching filenames.

Jun 23, 2016 1:10 AM in response to Camelot

Camelot,


Thank you very VERY much for your kindness, your time invested in such an altruistic help. I hope this little script helps not only me, but other people as well. I´m running it as I write, according to my calculations it will take about a week to finish (since a hear the little trash sound every time it deletes a file ), but it is obviously much better that doing it manually.


Please contact me through my web site if there is anything you need, I owe you one, a BIG one,


GF

manoDerecha

www.manoderecha.es

Jun 24, 2016 7:21 AM in response to Camelot

Camelot,


I´m sad to tell you that the script is not working any more😟(at least on my mac, with the folders I need it to work).


It ran fine (very slowly) the first time, it crashed in about an hour or two, and I re-started it a couple of times after that. Last night I left it running and encountered it this morning with a "time out" error. Maybe it is due to the fact that both folders are enormous (folder1 being almost 3 GB with near 30.000 files, folder 2 being more that ten times that), or some issue like a memory leak, I don´t know.


The only solution I´m able to think of is to re-run it after restarting my mac, and that isn´t helping. It works great with smaller folders though, I test it a little bit.


Any suggestion?

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Check for duplicate files

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.